Abilis Accounting CDRs

Leo Mantovani

Revision History
Revision 4	2019-06-12	LM
Changed disk states WORKING/FAILED with UP/DOWN.
Revision 3	2019-05-27	LM
Added milliseconds in alert/connect time. Added traps and mails section. Added disks verifications and recoveries.
Revision 2	2019-05-07	LM
Changed values for state_reached, in_type, out_type. Defined fraction of seconds in timestamps to be of 3 digits always present, see Date and Time representation.
Revision 1	2019-04-24	LM
Now ALL fields of CDR are defined and basic Abilis configuration requirement shown.
Revision 0	2019-04-03	LM
Document created.

Table of Contents

1. Introduction

2. Date and Time representation

3. Overview

4. Files

4.1. Creation of the sequence number

5. Disks verifications and recoveries

5.1. Disk in DOWN (failed) state

6. Traps and mails

7. CDR fields

7.1. CDRs for transferred/merged calls

8. Abilis configurations

8.1. FTP configuration
8.2. SIP configuration

9.

Bibliography

Procedures to store and retrieve Call Data Records (CDRs) for Abilis as SIP exchange for up to 4000 calls.

Abbreviations

1. Introduction

For every terminated call Abilis must store a record, so called CDR, with the information which are relevant for the billing.

There are core aspects that are relevant:

local storage on dedicated high performance disks (SSD disks)
redundant storage (two disks)
record validity check (hash)
textual representation (csv)
apply all reasonable action to protect data correctness and integrity, considering the Abilis platform characteristics and limitation (FAT file system)
simple methods for data retrieval from the external client application (ftps or https)
records added when call is terminated
regular shutdowns (e.g. warm start) must force call termination and safe CDR saving
CDRs for ongoing calls dropped by unexpected system reboot (e.g. power failure) is left to a second step, but it's behaviour must be defined and described.

2. Date and Time representation

We choose the W3C profile of ISO 8601 representation (see Bibliography for external links with full details).

A concise summary is, below.

	Warning
	The W3C profile of ISO 8601 says that the application should define the maximal number of digits for fractional part of second. Abilis goes further and specifies a fixed number of 3 decimals, always present.

Complete date plus hours, minutes, seconds and an decimal fraction of a second
      YYYY-MM-DDThh:mm:ss.sssTZD (eg 1997-07-16T19:20:30.450+01:00)

where:

     YYYY = four-digit year
     MM   = two-digit month (01=January, etc.)
     DD   = two-digit day of month (01 through 31)
     hh   = two digits of hour (00 through 23) (am/pm NOT allowed)
     mm   = two digits of minute (00 through 59)
     ss   = two digits of second (00 through 59)
     sss  = three digits representing milliseconds
     TZD  = time zone designator (Z or +hh:mm or -hh:mm)

3. Overview

The most critical element for storage in Abilis is that the only supported file system is FAT, and it has a series of well known weaknesses when files are created, extended, deleted. On the contrary file renaming and moving does not risk of corrupting FAT even in a sudden failure, and there are ways to handle it after a reboot.

To highly reduce risks of FAT corruption and consequent data loss we already successfully adopted the trick of creating files of preallocated size and never add, extend, delete files, or at least to minimize it. We will follow this idea for CDR recording.

Another aspect to follow is the concept of "write and forget" on a per-file basis, i.e. once CDRs are written they won't be touched anymore.

Since Abilis can't guarantee medium to long term storage, the idea is to make CDR files available at short intervals (e.g. 15 minutes), so that the billing system can frequently retrieve the CDRs.

To further add a protection level we will store the CDRs on two independent disks, so that failure of one disk will not stop the system, and the presence of a record hash will permit detection of data corruption and recovery from the other non-corrupted disk. We will have a fault only if both disks are corrupted.

We will add alarms, traps, mails, for the relevant conditions.

We can describe the core aspect of the CDRs recording and handling as follow:

the system works with a set of preallocated files of fixed size, filled with ctrl-z character.
the number of files needed is determined automatically from MAX-AGE, FILE-SIZE, INTERVAL parameters.
the number of files will be increased ONLY if the actual need demands for more files, but it will be done in shots to minimise "time window of risk", and "in advance" with respect to the actual need.
the file(s) of the current time interval will not be available for retrieval.
when the current time interval terminates or the file is full, the current file will be made available for retrieval and from this moment on Abilis will not touch such files, up to the need of "cleanup".
when a file has to be cleaned (MAX-AGE expiry or .INV extension) it is wiped with ctrl-z character.
filenames and file extensions will play a key role in procedures
client will use FTP(S) to retrieve list of files and then fetch the desired one
client can't delete files, but it can rename a file to request Abilis to wipe it with ctrl-z charcter, basically to comply with privacy needs
the same files will be made available on two disks and client will access them independently

4. Files

Files are created in the working directory and ftp/http must be properly configured to publish such files with the read/rename permission only.

If it is desired to leave to Abilis the full control on aging it will be enough to set only read permission.

Files are created in the working directory (e.g. E:\CDR\ and F:\CDR\)
Files are created with "hidden" attributes and wuill be removed when renamed to .CSV extension
Extensions and their meaning

.FRE

File free for later use. Hidden.

The filename is not relevant. It will likely have some kind of progressive number or have some "old" filename.

.CUR

File of the current period, opened for exclusive write. Hidden.

The filename contains date, daily sequence number, local time time of the beginning of the period it refers to.

.NXT

File for the next period, opened for exclusive write. Hidden.

The filename contains date, daily sequence number, local time of the beginning of the period it refers to.

.CSV

File containing valid data. Not hidden.

The filename preserve date, daily sequence number, local time of the beginning of the period it refers to.

.INV

File invalidated by client. Not hidden.

The client can change an extension from CSV to INV, periodically Abilis will check for INV presence to wipe them with ctrl-z and return to .FRE

Shall we include a protection to not wipe files before e certain age? E.g. guarantee that data is not wiped for e.g. 1 week ?
Filenames and their meaning

YYYYMMDD-SSS-hhmm.

YYYY

4 digit year (e.g. 2019)

MM

2 digit month (e.g. 04)

DD

2 digit day (e.g. 09)

SSSS

4 digits daily sequence number (e.g. 0007)

hh

2 digit hour (e.g. 18)

mm

2 digit minutes (e.g. 15)

Date and Time is UTC to skip issues with STD<>DST change

The presence of UTC time in the filename has the main purpose to avoid file name repetition if the sequence number has to be reset for some reason, e.g. if disks are replaced or reformatted.

It can also be used to quickly identify the period to which the file refers, mainly for some kind of manual inspection.

4.1. Creation of the sequence number

The sequence number MUST be guaranteed to be progressive within the day at the best of possibility, therefore here is precise list of actions to be made:

when CDR service starts (at boot or after CDR-ACT:NO->YES)
- scan both disk and apply the verification and eventually the corrective actions described in the specific chapter.
- if a .CUR file is present
  - if it's YYYYMMDD is equal to current UTC date, it's sequence number must be set as the one "in use", and start regular processing (if interval is over, the file will be immediately closed to .CSV and a new CUR file started).
  - if it's YYYYMMDD is NOT equal to current UTC date it must be closed and renamed to CSV, and proceed as if .CUR file was not present
- if a .CUR file is NOT present
  - scan all .CSV and .INV files having in filename the YYYYMMMDD of the current UTC date to detect the latest sequence number
  - start a new .CUR with the next sequence number
during normal CDR service operation the current sequence number is stored and increased in memory, without the need to scan all files present on disks when a new interval starts.

Of course the CDR service is in charge of it's integrity and it's verification or regeneration when the conditions suggest that it is no more reliable (e.g. service restart, internal failures, ...).

5. Disks verifications and recoveries

During normal work there are various disk activities, like listing, read, write, and so on, and every file I/O operation can terminate with an error code.

The action to be taken depends on the error code and on the procedure affected.

We can basically identify the following error typologies:

Procedural errors

These errors occur when the conditions are not as expected.

A typical example is creation of a file that already exists, or open a file for reading but it does not exist.

In these situations it's the procedure itself that must take the corrective action.

System errors

These errors occur when the operating system or the file system meets some particular software conditions. As example we could imagine the "too many files opened".

These errors could be limited to individual files, so some part of the procedure can continue to work while the other stops.

In these situation it's the procedure itself that must take the corrective actions, as well as a carefully designed wait-and-retry attempt, up to the handling of the persistent condition.

Disk errors

These errors occur when the disk is having some defect.

Since disk defects can be of various type, also the reaction should be of various type.

We can basically identify the following errors:

Limited Read/write errors

A typical example are "bad sector" or "data errors". These errors are "limited" in the sense that only a portion of the disk can be affected, leaving other parts functional.

Dealing with these errors in fully automated procedure is procedure dependent, it may range from procedure blockage or proceed discarding unreadable data. Mind that these errors often occurs after several retries, and thus with a not negligible delay. This delay must be taken into consideration when there are serialized actions.

As example, in ACNT-CDR environment writing to a .CUR file that shows this probles could be solved by forcedly close the .CUR file and try to go on with a new .FRE file, and send and alert.

Persistent errors

These errors occurs when the error may persist for long time until some event occurs or some action is taken.

Some of these events can be considered "normal" temporary failures (e.g. reset of a disk drive, removal/insert of the disk, etc...), some are more "abnormal" and it is not know if and when they could be solved (e.g. device timeout, drive not found, etc ...)

Errors that carry large delays, like "device timeout", must be carefully considered because they add a large delay and may destroy all the subsequent sequential activities.

In ACNT-CDR , for example, since the write to both disk is done sequentially in the same thread, the large timeout wasted on one disk can destroy the behaviour of the second disk, sending to hell the high reliability of the two disks approach.

Permanent errors

These errors occur when the error will never be recovered until some administrative action is taken.

A typical example is "fat corrupted".

These error necessarily require some special and specific intervention. The functionality can be resumed only after the recovery activities have been taken.

As said, procedural errors and system errors has to be handled in the most reasonable way by the procedure itself.

Focusing on disk errors we can identify important healthy actions for ACNT-CDR.

Limited read/write errors

The retries should be are already performed by the filesystem itself, so in general trying again would not succeed.

Le't sot forget that magnetic HD are more suitable to this kind of error then SSD.

Reaction depends on what is the action that is affected, fro example I can identify:

Read/Write to CUR

Try close CUR and restart with a new FRE. Send an alarm.

Also a simpler "Set disk to DOWN (failed) and stop activities on this disk." is probably acceptable.

Other actions

Set disk to DOWN (failed) and stop activities on this disk.

Persistent errors

Set disk to DOWN (failed) and stop activities on this disk.

Permanent errors

Set disk to DOWN (failed) and stop activities on this disk.

5.1. Disk in DOWN (failed) state

When a disk in in DOWN state:

It should be skipped in any disk activity because it can damage the activities on the other disk
A separate procedure in a separate thread should be started to periodically "probe" if the disk is returned functional. The period can be tuned, I expect that smtg like "every 30 sec" could be a good starting point.
When DOWN (failed) is recovered, and the recovery is somehow proven to be reliable, the state is changed to IN-USE and activities restarted.

6. Traps and mails

The following conditions are identified for alarm notifications, traps and mails.

Change on CDR-STATE (INACTIVE, UP, DOWN).
Change on Dx-STATE (UNUSED, UP, DOWN).
FIFO-CUR > 50% FIFO-SIZE (reset when FIFO-CUR=0)
FIFO-LOST (reset when FIFO-CUR=0)

7. CDR fields

List for discussion.

hash

md5 (should we use a different hash algorithm?)

always present

It is computed on field values and for empty fields one character SPACE (0x20) is used.

state_reached

routing, alerting, connected

always present

routing_time

YYYY-MM-DDThh:mm:ss.sssTZD

always present

alert_time

YYYY-MM-DDThh:mm:ss.sssTZD

present if alert state reached, otherwise empty

connect_time

YYYY-MM-DDThh:mm:ss.sssTZD

present if connect reached, otherwise empty

disconnect_time

YYYY-MM-DDThh:mm:ss.sssTZD

always present

alert_duration

ss (seconds)

or ss.sss (sec.msec)

or 0

connect_duration

ss (seconds)

or ss.sss (sec.msec)

or 0

direction

inout, output, local, transfer (probably rules needs deeper investigation for the public exchange environment)

in_type

ctip, clus(ter), sip, iax, disa, vo, vm, mix, sl

in_value

x, clusname, username

out_type

ctip, clus(ter), sip, iax, disa, vo, vm, mix, sl

empty if no routing found

out_value

x, clusname, username

empty if no routing found

cdi_ton